AITopics | actual performance

Collaborating Authors

actual performance

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Do Code Models Suffer from the Dunning-Kruger Effect?

Singh, Mukul, Chatterjee, Somya, Radhakrishna, Arjun, Gulwani, Sumit

arXiv.org Artificial IntelligenceOct-8-2025

As artificial intelligence systems increasingly collaborate with humans in creative and technical domains, questions arise about the cognitive boundaries and biases that shape our shared agency. This paper investigates the Dunning-Kruger Effect (DKE), the tendency for those with limited competence to overestimate their abilities in state-of-the-art LLMs in coding tasks. By analyzing model confidence and performance across a diverse set of programming languages, we reveal that AI models mirror human patterns of overconfidence, especially in unfamiliar or low-resource domains. Our experiments demonstrate that less competent models and those operating in rare programming languages exhibit stronger DKE-like bias, suggesting that the strength of the bias is proportionate to the competence of the models.

large language model, machine learning, programming language, (17 more...)

arXiv.org Artificial Intelligence

2510.05457

Genre:

Overview (1.00)
Research Report > New Finding (0.94)

Technology:

Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.94)
(2 more...)

Add feedback

Bayesian Risk Markov Decision Processes

Neural Information Processing SystemsAug-15-2025, 17:27:03 GMT

Markov decision process (MDP) is a paradigm for modeling sequential decision making under uncertainty. From a modeling perspective, some parameters of MDPs are unknown and need to be estimated from data.

artificial intelligence, br-mdp, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Georgia > Fulton County > Atlanta (0.04)
North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > Portugal > Braga > Braga (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

When Is Prior Knowledge Helpful? Exploring the Evaluation and Selection of Unsupervised Pretext Tasks from a Neuro-Symbolic Perspective

Jia, Lin-Han, Han, Si-Yu, Hu, Wen-Chao, Shao, Jie-Jing, Wei, Wen-Da, Zhou, Zhi, Guo, Lan-Zhe, Li, Yu-Feng

arXiv.org Artificial IntelligenceAug-12-2025

Neuro-symbolic (Nesy) learning improves the target task performance of models by enabling them to satisfy knowledge, while semi/self-supervised learning (SSL) improves the target task performance by designing unsupervised pretext tasks for unlabeled data to make models satisfy corresponding assumptions. We extend the Nesy theory based on reliable knowledge to the scenario of unreliable knowledge (i.e., assumptions), thereby unifying the theoretical frameworks of SSL and Nesy. Through rigorous theoretical analysis, we demonstrate that, in theory, the impact of pretext tasks on target performance hinges on three factors: knowledge learn-ability with respect to the model, knowledge reliability with respect to the data, and knowledge completeness with respect to the target. We further propose schemes to operationalize these theoretical metrics, and thereby develop a method that can predict the effectiveness of pretext tasks in advance. This will change the current status quo in practical applications, where the selections of unsupervised tasks are heuristic-based rather than theory-based, and it is difficult to evaluate the rationality of unsupervised pretext task selection before testing the model on the target task. In experiments, we verify a high correlation between the predicted performance--estimated using minimal data--and the actual performance achieved after large-scale semi-supervised or self-supervised learning, thus confirming the validity of the theory and the effectiveness of the evaluation method.

artificial intelligence, knowledge, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2508.07299

Country: Asia > China (0.15)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.75)

Add feedback

Interpretable Droplet Digital PCR Assay for Trustworthy Molecular Diagnostics

Wei, Yuanyuan, Wu, Yucheng, Qu, Fuyang, Mu, Yao, Ho, Yi-Ping, Ho, Ho-Pui, Yuan, Wu, Xu, Mingkun

arXiv.org Artificial IntelligenceJan-15-2025

Accurate molecular quantification is essential for advancing research and diagnostics in fields such as infectious diseases, cancer biology, and genetic disorders. Droplet digital PCR (ddPCR) has emerged as a gold standard for achieving absolute quantification. While computational ddPCR technologies have advanced significantly, achieving automatic interpretation and consistent adaptability across diverse operational environments remains a challenge. To address these limitations, we introduce the intelligent interpretable droplet digital PCR (I2ddPCR) assay, a comprehensive framework integrating front-end predictive models (for droplet segmentation and classification) with GPT-4o multimodal large language model (MLLM, for context-aware explanations and recommendations) to automate and enhance ddPCR image analysis. This approach surpasses the state-of-the-art models, affording 99.05% accuracy in processing complex ddPCR images containing over 300 droplets per image with varying signal-to-noise ratios (SNRs). By combining specialized neural networks and large language models, the I2ddPCR assay offers a robust and adaptable solution for absolute molecular quantification, achieving a sensitivity capable of detecting low-abundance targets as low as 90.32 copies/{\mu}L. Furthermore, it improves model's transparency through detailed explanation and troubleshooting guidance, empowering users to make informed decisions. This innovative framework has the potential to benefit molecular diagnostics, disease research, and clinical applications, especially in resource-constrained settings.

droplet, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2501.09218

Country:

Asia > China (0.69)
Europe (0.67)
North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.86)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.93)

Add feedback

Performance Prediction for Multi-hop Questions

Samadi, Mohammadreza, Rafiei, Davood

arXiv.org Artificial IntelligenceAug-11-2023

We study the problem of Query Performance Prediction (QPP) for open-domain multi-hop Question Answering (QA), where the task is to estimate the difficulty of evaluating a multi-hop question over a corpus. Despite the extensive research on predicting the performance of ad-hoc and QA retrieval models, there has been a lack of study on the estimation of the difficulty of multi-hop questions. The problem is challenging due to the multi-step nature of the retrieval process, potential dependency of the steps and the reasoning involved. To tackle this challenge, we propose multHP, a novel pre-retrieval method for predicting the performance of open-domain multi-hop questions. Our extensive evaluation on the largest multi-hop QA dataset using several modern QA systems shows that the proposed model is a strong predictor of the performance, outperforming traditional single-hop QPP models. Additionally, we demonstrate that our approach can be effectively used to optimize the parameters of QA systems, such as the number of documents to be retrieved, resulting in improved overall retrieval performance.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2308.06431

Country:

North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.14)
North America > United States > New York > New York County > New York City (0.04)
Asia > China > Hong Kong (0.04)
(6 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (0.93)
Media > Film (0.67)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.93)

Add feedback

From Contextual Data to Newsvendor Decisions: On the Actual Performance of Data-Driven Algorithms

Besbes, Omar, Ma, Will, Mouchtaki, Omar

arXiv.org Artificial IntelligenceJul-27-2023

In this work, we explore a framework for contextual decision-making to study how the relevance and quantity of past data affects the performance of a data-driven policy. We analyze a contextual Newsvendor problem in which a decision-maker needs to trade-off between an underage and an overage cost in the face of uncertain demand. We consider a setting in which past demands observed under ``close by'' contexts come from close by distributions and analyze the performance of data-driven algorithms through a notion of context-dependent worst-case expected regret. We analyze the broad class of Weighted Empirical Risk Minimization (WERM) policies which weigh past data according to their similarity in the contextual space. This class includes classical policies such as ERM, k-Nearest Neighbors and kernel-based policies. Our main methodological contribution is to characterize exactly the worst-case regret of any WERM policy on any given configuration of contexts. To the best of our knowledge, this provides the first understanding of tight performance guarantees in any contextual decision-making problem, with past literature focusing on upper bounds via concentration inequalities. We instead take an optimization approach, and isolate a structure in the Newsvendor loss function that allows to reduce the infinite-dimensional optimization problem over worst-case distributions to a simple line search. This in turn allows us to unveil fundamental insights that were obfuscated by previous general-purpose bounds. We characterize actual guaranteed performance as a function of the contexts, as well as granular insights on the learning curve of algorithms.

actual performance, data-driven algorithm, newsvendor decision, (1 more...)

arXiv.org Artificial Intelligence

2302.08424

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Nearest Neighbor Methods (0.53)

Add feedback

Dynamic Ensemble of Low-fidelity Experts: Mitigating NAS "Cold-Start"

Zhao, Junbo, Ning, Xuefei, Liu, Enshu, Ru, Binxin, Zhou, Zixuan, Zhao, Tianchen, Chen, Chen, Zhang, Jiajin, Liao, Qingmin, Wang, Yu

arXiv.org Artificial IntelligenceFeb-2-2023

Predictor-based Neural Architecture Search (NAS) employs an architecture performance predictor to improve the sample efficiency. However, predictor-based NAS suffers from the severe ``cold-start'' problem, since a large amount of architecture-performance data is required to get a working predictor. In this paper, we focus on exploiting information in cheaper-to-obtain performance estimations (i.e., low-fidelity information) to mitigate the large data requirements of predictor training. Despite the intuitiveness of this idea, we observe that using inappropriate low-fidelity information even damages the prediction ability and different search spaces have different preferences for low-fidelity information types. To solve the problem and better fuse beneficial information provided by different types of low-fidelity information, we propose a novel dynamic ensemble predictor framework that comprises two steps. In the first step, we train different sub-predictors on different types of available low-fidelity information to extract beneficial knowledge as low-fidelity experts. In the second step, we learn a gating network to dynamically output a set of weighting coefficients conditioned on each input neural architecture, which will be used to combine the predictions of different low-fidelity experts in a weighted sum. The overall predictor is optimized on a small set of actual architecture-performance data to fuse the knowledge from different low-fidelity experts to make the final prediction. We conduct extensive experiments across five search spaces with different architecture encoders under various experimental settings. Our method can easily be incorporated into existing predictor-based NAS frameworks to discover better architectures.

architecture, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2302.00932

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre:

Research Report (0.64)
Workflow (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

How to test ML models in the real world

#artificialintelligenceJun-3-2022, 20:08:11 GMT

How often do you test ML models in a Jupyter notebook, get good results, but still cannot convince your boss that the model should be used right away? Or maybe you manage to convince her and put the model in production, but you do not see any impact on business metrics? Luckily for you, there are better ways to test ML models in the real world and to convince everyone (including you) that they add value to the business. In this article you will learn what these evaluation methods are, how to implement them, and when should you use each. We, data scientists and ML engineers, develop and test ML models in our local development environment, for example, a Jupyter notebook.

ml model, portfolio, test ml model, (14 more...)

#artificialintelligence

Genre: Instructional Material (0.55)

Industry: Banking & Finance (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Why the high accuracy in classification is not always correct?

#artificialintelligenceMay-22-2022, 04:00:38 GMT

Classification accuracy is a statistic that describes a classification model's performance by dividing the number of correct predictions by the total number of predictions. It is simple to compute and comprehend, making it the most often used statistic for assessing classifier models. But not in every scenario accuracy score is to be considered the best metric to evaluate the model. In this article, we will discuss the reasons not to believe in the accuracy performance parameter completely. Following are the topics to be covered.

accuracy, classification, prediction, (16 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)

Add feedback

AI in Medicine -- Prospective versus Retrospective

#artificialintelligenceApr-3-2022, 09:15:06 GMT

Just like Sedol Lee was defeated by AlphaGo three or four years ago, there was an atmosphere that artificial intelligence would replace experts in medicine and replace everything in the world. The achievements of AI in the medical field were recorded one by one in an IEEE Spectrum ("AI versus Doctor"; https://ieeexplore.ieee.org/document/8048826). However, since a year or two ago, the main focus has moved to the role of artificial intelligence as an assistance tool for experts, and recently, it is not uncommon to hear that artificial intelligence is not making a profit in business. Even IBM's Watson was sold with some criticism. There may be a problem in some way, so why are we hearing these news?

algorithm, dermatologist, specialist, (15 more...)

#artificialintelligence

Genre: Research Report > Experimental Study (0.73)

Industry: Health & Medicine > Therapeutic Area (0.57)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.70)

Add feedback